── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_classic(base_size =16))
1 Learning objectives
Understand probability as a measure of uncertainty
Summarize probabilities with probability distributions
Think about uncertainty in sampling
Summarize uncertainty in sampling: Sampling distribution
2 Mathematically-derived sampling distribution
Many combinations of population distribution and test statistic have mathematically-derived sampling distributions
For example, with a normally distributed population (\mathcal{N}(\mu, \sigma^2)), the sampling distribution of the mean is \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
In this case, the population distribution and sampling distribution are the same distribution: Normal
The variance (or standard deviation) of the sampling distribution decreases as n increases: A larger sample leads to a smaller, more precise standard error
Note
There is not a mathematically-derived sampling distribution for every population distribution and test statistic
If you’re dealing with unknown or unusual distributions or test statistics, your only option may be computational methods
3 Computationally-derived sampling distribution
3.1 Getting started
Set a random number seed
If you don’t do this, every time you run the code, you’ll get different numbers
Setting a random seed makes your simulation replicable
(I normally set a seed at the start of each document, but I moved it down here because it’s very important in this lab.)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
3.4.3 Compare estimates
Compare means, standard deviations, and variances for each sample size
mean(pois_means10$x_bar)
[1] 1.49989
mean(pois_means50$x_bar)
[1] 1.502572
mean(pois_means100$x_bar)
[1] 1.503961
sd(pois_means10$x_bar)
[1] 0.3863003
sd(pois_means50$x_bar)
[1] 0.1733869
sd(pois_means100$x_bar)
[1] 0.1214733
var(pois_means10$x_bar)
[1] 0.1492279
var(pois_means50$x_bar)
[1] 0.03006303
var(pois_means100$x_bar)
[1] 0.01475576
How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
4 Summarize results
In general, how does sample size impact the standard error?
i.e., the standard deviation of the sampling distribution?
Does it seem to matter what the population distribution is?
Based on: Visual inspection of the distribution
Based on: Estimates of standard error
Source Code
---title: "BTS 510 Lab 3"format: html: embed-resources: true self-contained-math: true html-math-method: katex number-sections: true toc: true code-tools: true code-block-bg: true code-block-border-left: "#31BAE9"---```{r}#| label: setuplibrary(tidyverse)theme_set(theme_classic(base_size =16))```## Learning objectives* Understand **probability** as a measure of **uncertainty*** **Summarize** probabilities with *probability distributions* * Think about **uncertainty** in sampling* **Summarize** uncertainty in sampling: **Sampling distribution**## Mathematically-derived sampling distribution* Many combinations of **population distribution** and **test statistic** have *mathematically-derived* sampling distributions * For example, with a **normally distributed** population ($\mathcal{N}(\mu, \sigma^2)$), the sampling distribution of the **mean** is $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$ * In this case, the *population* distribution and *sampling* distribution are the **same distribution**: Normal * The variance (or standard deviation) of the sampling distribution **decreases** as $n$ increases: A *larger sample* leads to a smaller, *more precise standard error*::: {.callout-note}* There is not a mathematically-derived sampling distribution for every population distribution and test statistic * If you're dealing with *unknown or unusual distributions or test statistics*, your only option may be *computational methods*:::## Computationally-derived sampling distribution### Getting started* Set a **random number seed** * If you don't do this, every time you run the code, you'll get different numbers * Setting a random seed makes your simulation *replicable* * (I normally set a seed at the start of each document, but I moved it down here because it's very important in this lab.)* Load (and install, if needed) the [**infer** package](https://infer.tidymodels.org/) * **infer** has a function that we'll use to *repeatedly sample* * There are other ways to do this, but this is very simple and works well* Only have to do this once per document```{r}set.seed(12345)library(infer)```### Standard Normal population: $population \sim \mathcal{N}(0, 1)$#### Create a population* `rnorm()` creates a random variable that follows a **normal** distribution * First argument = Number of observations (population size) * Second argument = $\mu$ * Third argument = $\sigma$ (not $\sigma^2$)```{r}norm_pop <-rnorm(1000000, 0, 1)mean(norm_pop)sd(norm_pop)var(norm_pop)```#### Sample from the population* Small sample: $n$ = 10 * `rep_sample_n()` repeatedly samples from the dataset * Piped (`%>%`) from the `norm_pop` dataset, so it resamples from that * `size = 10`: Sample size for each sample * `reps = 10000`: Number of repeated samples (could be larger but this is good enough to show how things work and doesn't take too long to run) * `replace = TRUE`: Sample with replacement * `summarise(x_bar = mean(norm_pop))`: Create a data frame with each of the means from each sample * The `ggplot()` code creates a histogram from the data * Modification: `bins = 50` increases the bins to make it a little smoother * Modification: `xlim(-1.5, 1.5)` sets limits on the x-axis to make the plots more comparable * Make similar modifications in your own plots later in this document * We'll spend the next 2 weeks on `ggplot()` so you'll learn all of this (and more!)```{r}norm_means10 <-data.frame(norm_pop) %>%rep_sample_n(size =10, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(norm_pop))ggplot(data = norm_means10, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(-1.5, 1.5)```* Moderate sample: $n$ = 50```{r}norm_means50 <-data.frame(norm_pop) %>%rep_sample_n(size =50, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(norm_pop))ggplot(data = norm_means50, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(-1.5, 1.5) ```* Large sample: $n$ = 100```{r}norm_means100 <-data.frame(norm_pop) %>%rep_sample_n(size =100, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(norm_pop))ggplot(data = norm_means100, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(-1.5, 1.5)```#### Compare estimates* Compare means, standard deviations, and variances for each sample size```{r}mean(norm_means10$x_bar)mean(norm_means50$x_bar)mean(norm_means100$x_bar)``````{r}sd(norm_means10$x_bar)sd(norm_means50$x_bar)sd(norm_means100$x_bar)``````{r}var(norm_means10$x_bar)var(norm_means50$x_bar)var(norm_means100$x_bar)```* How does each value correspond to what you'd expect based on the mathematically-derived sampling distribution? * $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$### Bernoulli population: $population \sim B(0.3)$#### Create a population* `rbinom()` creates a random variable that follows a **binomial** distribution * First argument = Number of observations (population size) * Second argument = number of trials (here, just 1 for a **Bernoulli** trial) * Third argument = $p$ (probability of success)```{r}bernoulli_pop <-rbinom(1000000, 1, 0.3)mean(bernoulli_pop)sd(bernoulli_pop)var(bernoulli_pop)```* Copy and modify the code above to resample from this new Bernoulli population#### Sample from the population* Small sample: $n$ = 10```{r}bern_means10 <-data.frame(bernoulli_pop) %>%rep_sample_n(size =10, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(bernoulli_pop))ggplot(data = bern_means10, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(0, 1)```* Moderate sample: $n$ = 50```{r}bern_means50 <-data.frame(bernoulli_pop) %>%rep_sample_n(size =50, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(bernoulli_pop))ggplot(data = bern_means50, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(0, 1)```* Large sample: $n$ = 100```{r}bern_means100 <-data.frame(bernoulli_pop) %>%rep_sample_n(size =100, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(bernoulli_pop))ggplot(data = bern_means100, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(0, 1)```#### Compare estimates* Compare means, standard deviations, and variances for each sample size```{r}mean(bern_means10$x_bar)mean(bern_means50$x_bar)mean(bern_means100$x_bar)``````{r}sd(bern_means10$x_bar)sd(bern_means50$x_bar)sd(bern_means100$x_bar)``````{r}var(bern_means10$x_bar)var(bern_means50$x_bar)var(bern_means100$x_bar)```* How does each value correspond to what you'd expect based on the mathematically-derived sampling distribution **for a normally distributed population**? * $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$### Poisson population: $population \sim Poisson(1.5)$#### Create a population* `rpois()` creates a random variable that follows a **Poisson** distribution * First argument = Number of observations (population size) * Second argument = $\lambda$ (mean and variance) * Remember that $\lambda$ is the variance, but we're going to look at the **standard deviation** which is the *square root* of the variance```{r}poisson_pop <-rpois(1000000, 1.5)mean(poisson_pop)sd(poisson_pop)var(poisson_pop)```* Copy and modify the code above to resample from this new Poisson population#### Sample from the population* Small sample: $n$ = 10```{r}pois_means10 <-data.frame(poisson_pop) %>%rep_sample_n(size =10, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(poisson_pop))ggplot(data = pois_means10, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(0, 4)```* Moderate sample: $n$ = 50```{r}pois_means50 <-data.frame(poisson_pop) %>%rep_sample_n(size =50, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(poisson_pop))ggplot(data = pois_means50, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(0, 4)```* Large sample: $n$ = 100```{r}pois_means100 <-data.frame(poisson_pop) %>%rep_sample_n(size =100, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(poisson_pop))ggplot(data = pois_means100, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(0, 4)```#### Compare estimates* Compare means, standard deviations, and variances for each sample size```{r}mean(pois_means10$x_bar)mean(pois_means50$x_bar)mean(pois_means100$x_bar)``````{r}sd(pois_means10$x_bar)sd(pois_means50$x_bar)sd(pois_means100$x_bar)``````{r}var(pois_means10$x_bar)var(pois_means50$x_bar)var(pois_means100$x_bar)```* How does each value correspond to what you'd expect based on the mathematically-derived sampling distribution **for a normally distributed population**? * $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$## Summarize results* In general, how does **sample size** impact the **standard error**? * i.e., the standard deviation of the sampling distribution?* Does it seem to matter what the population distribution is? * Based on: Visual inspection of the distribution * Based on: Estimates of standard error